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Abstract. We address the problem of un-supervised soft-clustering that 
we call micro-clustering. The aim of the problem is to enumerate all 
groups composed of records strongly related to each other, while standard 
clustering methods find boundaries at where records are few. The exist¬ 
ing methods have several weak points; generating intractable amount of 
clusters, quite biased size distribution, robustness, etc. We propose a new 
methodology data polishing. Data polishing clarifies the cluster structures 
in the data by perturbating the data according to feasible hypothesis. In 
detail, for graph clustering problems, data polishing replaces dense sub¬ 
graphs that would correspond to clusters by cliques, and deletes edges 
not included in any dense subgraph. The clusters are clarified as maxi¬ 
mal cliques, thus easy to be found, and the number of maximal cliques 
is reduced to tractable numbers. We also propose an efficient algorithm 
so that the computation is done in few minutes even for large scale data. 
The computational experiments demonstrate the efficiency of our formu¬ 
lation and algorithm, i.e., the number of solutions is small, such as 1,000, 
the members of each group is deeply related, and the computation time 
is short. 


1 Introduction 

Unsupervised clustering is one of the central tasks in data analysis. It is used 
in various areas and has many applications, such as Web access log analysis, 
recommendation, herding, and mobility analysis. It is actually a data mining 
task since we can hopefully find some groups that are not known, or not easy to 
explain from background knowledge. Usually, unsupervised clustering algorithms 
aim to partition the data into several groups, or to obtain distributions of group 
members. However, this approach is often not efficient for big data with large 
diversity, since many elements may belong to several or no groups. Another 
approach is to find groups of elements that are deeply related to each other. 
Examples are community mining in social networks and biclustering by pattern 
mining. We here call the groups of elements deeply related to each other micro¬ 
clusters, and the problem of mining micro-clusters micro-clustering. 


Modeling and problem formulation of micro-clustering are non-trivial. For 
example, consider a similarity graph of a data in which the vertices are elements, 
and two vertices are connected if the corresponding two elements are similar, 
or strongly related. A clique (complete subgraph) of a similarity graph is a 
good candidate for micro-clustering, thus we are motivated to enumerate all 
maximal cliques in the graph, where a maximal clique is a clique included in 
no other cliqu^l- Community mining often uses maximal clique enumeration in 
social networks in which the cliques are considered as cores of the communities. 
However, similarity graphs usually include a huge number of maximal cliques due 
to ambiguities, thus maximal cliques enumeration is often intractable in practice. 
Some propagation methods |61 11) extract structures that can be seen as micro¬ 
clusters. Then, completeness becomes a difficult issue. Repetitive executions of 
these methods yield many similar groups corresponding to one group, and often 
miss some weak groups that are hidden by intensive groups. Global optimization 
approach such as modularity maximization [i] or graph cut[7], To the best of our 
knowledge, there is no efficient method for micro-clustering. 

One of the difficulty of micro-clustering is the problem formulation; the clus¬ 
ters in the data is hard to describe only with mathematical terms. “Enumeration 
of maximal cliques” involves numerous solutions, “modularity maximization” 
gives incentives to enlarge large clusters much more, “finding local dense struc¬ 
tures by propagation (random walks)” hides small weak clusters surrounding 
a dense and relatively large cluster. The form of clusters differ in many kinds 
of data and problems, and our objectives, so it is naturally difficult, and there 
may be no general answer. On the other hand, we have known the result of the 
clustering algorithms are often good, and valuable to use for machine learning, 
visualization, natural language processing, etc. This fact implies that comput¬ 
ing good clusters might be easier than modeling the clusters in general terms. 
According to the Vapnik’s principle, now we might solve a difficult problem to 
solve a relatively easier problem, that is said to be not a good approach. It is 
worth considering how to obtain good clusters without mathematically written 
problem formulation. 

In this paper, instead of solving clustering problem directly, we address the 
problem of eliminating the ambiguity to make the clustering easy. We propose a 
new concept called data polishing that is to clarify the structures in data to be 
found by modifying the data according to feasible hypothesis. For example, it 
modifies the input graph so that similar structures (groups, patterns, segments, 
etc.) considered to be the same one (having the same meaning) will be unified 
into one. The goal of the data polishing is that we will have one to one corre¬ 
spondence between the objects (meanings) we want to find, and the structures 
modeling the objects. The difference from data cleaning is that data cleaning 
usually correct the each entity of the data to delete the noise, thus does not aim 
to change the solutions to the original data, while data polishing actively modi¬ 
fies the data so that the solutions will change. For example, consider an example 

® maximal clique and maximum cliques are different; a maximum clique is clique of 
the maximum size. 





of community mining by finding cliques of size k. A typical data cleaning deletes 
vertices of degrees less than fc — 1 so that no clique of size k will be missed. In 
a data polishing way, we consider that a community is a pseudo clique (dense 
subgraph) in the graph, thus replace pseudo cliques by cliques. This drastically 
changes the result of the clique mining. A community in an SNS network is a 
dense subgraph, but it usually has many missing edges, thus it is not a clique. 
These missing edges are actually “truth” and not noise, thus they are difficult 
to recognize them as pairs that should be connected, with the data cleaning 
analogy. However, a community is preferred to be a clique, in the sense of the 
motivation of modeling of communities. Data polishing modifies the data so that 
the structures we want to hnd will be what we think they should be, instead of 
allowing the individual record to be broken. We observe that when the graph is 
clear such that it is composed only of cliques corresponding to communities, the 
mining results will be fine. This is the origin of the idea of data polishing. 

The enumeration of pseudo cliques is actually a hard task; graphs often in¬ 
clude much more maximal pseudo cliques compared to maximal cliques [5D]. In¬ 
stead of enumerating pseudo cliques, we use a feasible hypothesis; two vertices 
have many common neighbors in the graph if they are included in a dense sub¬ 
graph of a certain size (see Figured]). We can also use a similarity of neighbors 
instead of the number of common neighbors. It is possible that two vertices 
are included in a dense subgraph but have few common neighbors for example a 
graph composed of a vertex v and a clique. In such cases, v and the clique should 
not be in the same micro-cluster, thus we think this hypothesis is acceptable. 
We find all vertex pairs having at least a certain number of common neighbors 
(or the similarity of neighbors is no less than the given threshold), and connect 
each pair by an edge. On the other hand, we delete all edges whose endpoints 
do not satisfy the condition, since they are considered as not being in the same 
cluster. We repeat this operation until the graph will not change, and obtain a 
“polished” graph. 

In summary, our micro-clustering algorithm is composed of three parts, (1) 
construction of the similarity graph, (2) data polishing, and (3) maximal clique 
enumeration. As the number of maximal cliques becomes small through data 
polishing, (3) is not a difficult task. The computational cost of (1) and (2) de¬ 
pends on the choice of the similarity measure and the definition of data polishing. 
Usually these tasks are not heavy because the data and graph are usually sparse. 
Our computational experiments show the efficiency of our data polishing algo¬ 
rithm. For example, it reduces the number of maximal cliques in a social network 
from 33,000 to 300 and in a similarity graph of news articles from over 55 million 
to 100,000. The contents of the cluster are acceptable and understandable, at 
least in our computational experiments. The detection accuracy of clusters for 
randomly generated data is significantly better compared to existing algorithms. 
The quality and the performance of our algorithm drastically outperforms the 
existing algorithms. 

The organization of this paper is as follows. The next section introduces 
the notations. In Section 3, we describe our model of data polishing for micro- 



Fig. 1. Two vertices (black) in a dense graph have many (4) common neighbors 
(gray) 


clustering in a graph, and describe a fast algorithm for data polishing and similar¬ 
ity graph construction in Section 4. Section 5 explains an algorithm for maximal 
clique enumeration. In Section 6, we show the results of computational results, 
and discuss the result in Section 7. We conclude the paper in Section 8. 

1.1 Related Works 

There are many unsupervised clustering algorithms. Typical algorithms parti¬ 
tion data into two or several groups by learning the boundaries of the groups. 

A typical approach to obtain many clusters is to recursively apply these cluster¬ 
ing algorithms. However, in the top levels of the partition, finding a boundary 
composed of boundaries of many clusters is very difficult, and micro-clusters are 
often broken by global cuts. 

For finding clusters in a similarity graph, Girvan-Newman clustering algorithm[4] 
and its many variants are often used. The algorithms partition the data into 
many groups according to the modularity, i.e., a model of clustering quality. 
However, they often produce few very large groups that are a mixture of many 
micro-clusters and many quite small groups. We observed this in our computa¬ 
tional experiments. 

Some random walk algorithms and propagation algorithms |61 1 1) find a cluster 
from a seed vertex. The obtained cluster can be seen as a main cluster among 
those to which the seed vertex belongs. For enumeration, we often apply these 
algorithms to all vertices individually. This results in many quite similar groups 
that correspond to one cluster; on the other hand, weak clusters hidden by strong 
clusters may be missed. Moreover, even one execution of these algorithms takes 
sufficiently long time, the total computation time becomes huge. 

Some researchers attempted to enumerate maximal cliques from graphs. For 
example, Kumar et al. enumerated bipartite maximal cliques from Web link 
graphs]^. However, the number of solutions is usually so huge. In our experi¬ 
ments, the number of maximal cliques in a similarity graph from Reuters news 
articles was above 50,000,000. There are currently no algorithm to drastically 
reduce the number of maximal cliques. 










An idea to remove edges from the network so that extraction of clusters 
will become easy has been proposed by Satuluri et. al. that is called local 
sparsification. The idea to identify the edges to be removed from the network is 
same as our method, but they do not add edges, and do not repeat the process, 
thus the result would be quite different. Also a similar idea exists that is called 
stochastic flow clustering method[16). The idea is update distance matrix iter¬ 
atively so that the items belonging to the same cluster would have distance of 
zero. Compared to this, our method can be considered as a discrete version with 
allowing scalable algorithms applicable to large scale graph data. 

2 Preliminary 

A database Z? is a collection of records Ti,... ,T„. A record can be any kinds 
of structures, but we restrict the database such that each record belongs to the 
same class of structures, such as itemsets, sequences, and labeled graphs. For a 
similarity measure sim, the similarity graph Gsim{D) of the database D is the 
graph such that the vertex set is its records, and an edge connects two records 
u and V if and only if sim{u,v) > 6. We simply write the similarity graph 
G = (V, E) and assume that V = {1,..., n} if there is no confusion. A clique is a 
set of vertices such that every pair of vertices is connected by an edge. Cliques 
are usually defined by a subgraph, but in this paper we use this definition for 
conciseness. A clique included in no other clique is called maximal. The size of 
a clique is the number of its vertices. 

A pair of vertices not connected by an edge is called non-edge. The density of 
a graph G = {V.E) is the ratio of edge existences, i.e., \v \ { \ \/\-i)/ 2 - density 
of a vertex set U is defined by the density of the subgraph G\U] = {U,E\U]), 
where E\U] is the set of edges that are connecting two vertices in U. A pseudo 
clique is a vertex set having sufficiently large density, such as that with density 
no less than the threshold 6 . We call a vertex set of density 5 6 -pseudo clique. 

For vertex u in G, a vertex adjacent to v is called a neighbor of v. The set 
of neighbors of v is denoted by N{v). A^[u] denotes N{v) U {u} and is called 
the closed neighbor. A vertex w is a common neighbor of vertices u and u if w 
is adjacent to both u and v. The degree of a vertex v is the number of vertices 
adjacent to u, and denoted by d(v). For a subgraph K of G, the degree of v in 
K is the number of vertices in K that are adjacent to v and denoted by dK(v). 
The minimum degree in K is min^g^f dKiv). 


3 Data Polishing for Micro-Clustering 

We first state our micro-clustering problem from conceptual aspects, and pro¬ 
pose our data polishing algorithms to clarify the clusters. We also state some 
properties that are mathematical aspects of our algorithms. Note that the state¬ 
ments of properties are of the worst cases, thus they do not mean the average 
cases. Micro-clusters are groups of data records that are similar or related to 



each other, and desired to have one meaning, or correspond to one gronp. By 
considering the applications of micro-clustering, micro-clusters should satisfy the 
following conditions. 

1 quantity (the number of micro-clusters found should not be huge) 

2 independence (micro-clusters should not be similar to each other) 

3 coverage (all micro-clusters should be found) 

4 granularity (the granularity of micro-clusters should be the same) 

5 rigidity (the micro-clusters found should not change due to un-essential 
changes such as random seeds or indices of records) 

From conditions 3 and 4, we approach the micro-clustering problem by struc¬ 
tural enumeration, since structures such as trees and cliques are considered to 
correspond micro-clusters with similar granularity. Condition 5 leads us not to 
use randomness and computation depends on the vertex ID ordering. For the re¬ 
maining conditions 1 and 2 , we can observe that the ambiguity in the data makes 
algorithms violating the conditions; noise drastically increase the structures to 
be enumerated, and the structures are too similar thereby not independent. In a 
graph, micro-clusters are considered to correspond to dense subgraphs, and the 
non-edges in the dense subgraphs are ambiguity. We also consider that edges 
included in no clusters are also ambiguity. The concept of data polishing for 
micro-clustering comes from these; add edges for these non-edges, and remove 
these edges from the graph. For identifying these non-edges and edges, we con¬ 
sider the following feasible hypothesis. 

If vertices u and v are in the same clique of size k, u and v have at least k — 2 
common neighbors. Thus, we have |iV[n] Cl N[v]\ > k, and this is a necessary 
condition that u and v are in a clique of size at least k. We call this condition 
k-common neighbor condition. If u and v are in a sufficiently large pseudo clique, 
they are also expected to satisfy this condition. For example, the average degree 
in a pseudo clique of size 2k whose edge density is 80% is 1 . 6 fc, thus any of its 
two vertices satisfy the condition with high probability. On the contrary, if two 
vertices do not satisfy the condition, they belong to a pseudo clique with very 
small probability. Even though they belong to a pseudo clique, they actually 
seem to be disconnected in the clique, thus we may consider they should not 
be in the same cluster. Let P^iG) = {V,E') where E' is the correction of edges 
connecting vertex pairs satisfying the fc-common neighbor condition, and the 
polishing process is the computation of P^iG) from G. We call this process 
k-intersection polishing. 

The following property assures that we can find pseudo cliques as cliques in 
the processed graph. 

Property 1. A vertex set K C V oi jk vertices is a clique in P^(G) when the 
minimum degree of AT in G is at least (7 -I- l)fc/2. 

Proof. Any two vertices u and v can have at most 2 x( 7 A:—1— ) = ^]^—2 — k 

non-common neighbors. Thus, |Af[u] fl A^[r’]| > k. □ 

Note that the density of K is always no less than (7 -I- I)/ 27 . 



Theorem 1. Let K be a 5-pseudo clique (S < 1) in G of size where the 
degree of each node is at least (7 + l)k/s — 1 for 7 > 1 and s > 2. The density 
of K in P^{G) is more than that in G if 1 — S < • 

Proof. Partition the vertices of K into 


P = {v € V \ {'j l)k/s < dK{v) < (7 + l)k/2 }, 

Q = {v € V \ l)k/2 < dK{v) < (s — 1)(7 + l)fc/s }, and 

R = {v & V \ (s— 1)(7 + l)fc/s < dxiv) } . 


Let p =\P\ and q = |Q|. Concerning the numbers M and M' of non-edges in K 
and K in P^{G), respectively, we have the following: 




^ > 2 i 


(7 -I- l)k 


p-\- [^k — 


(s — 1)(7 -I- l)k 






From Eq. ([3]), we have q < ~ fP- Together with Eq. (|4]), 

2sM s - 1 


M' <p 


(7 - l)fc 


s — 1 


P- 


2sM 


2{sMf 


{s - l)(p) - l)k) (s - 1)((7 - l)fc )2 


< 


2s^M‘^ 


(s — l)(j — l)'^k^ 


( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 


Therefore, we have ^ < 1 from Eq. (P and 

(l-iJ) < □ 

Corollary 1. Let K be a % I'd-pseudo clique of ^k vertices whose minimum de¬ 
gree is at least ( 7 -I- l)fc/3. The density of K always increases by k-intersection 
polishing if j > 2 -\- 2^/^ ~ 3.4142. 


Erom the theorem and corollary, we see that any sufficiently large pseudo 
clique with sufficiently large density increases its density in P^{G). Since the 
condition of the theorem is far from tight, we can expect that the densities of 
smaller subgraphs also increase in real world data. Hence, by repeatedly com¬ 
puting P^{G), P^{P^(G)) and so on, we hope that it will converge to a graph 
P^*{G) satisfying P^*{G) = P^{P^*{G)). However, there is a counter example, 
thus we choose a number t and repeat the computation at most r times. 
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Fig. 2. Example of a graph that k intersection polishing does not converge 

Theorem 2. There is a graph G' = (y',E') such that P^{G") ^ G" holds for 
any G" = P'=(P'=(- ■ • P'=(G") ■■■)) fork > A. 

Proof. We describe the example shown in Figure [5] In the figure, B and C are 
a set of A: — 2 vertices such that vertex ({a}, B) and (B, C) induce complete 
bipartite graphs where no vertex pair in B or C is connected. For each u G B 
and t; G C, we put a clique of size A: + 2 so that the clique includes (u, v) but no 
other vertex in the graph. This is to make u and v always have at least A: + 2 
common neighbors. For each v G BUG, we prepare vertices wi and W 2 , and put 
two cliques of size A: + 2 including only (a,Wi) and (wi,v) in the same manner. 
This is to increase the common neighbors of a and v by two. 

Let us consider \N[u] niV[u]| for each vertex pair {u,v). For each edge {u,v) 
in an attached clique, |A^[u] niV[u]| > A:+ 2. Each vertex u ^ {a}UBUG satisfies 
|A^[u] n N[v]\ < 2 for any vertex v not in the clique that includes u. Any vertex 
pair {u, v) taken from B or G satisfies \N[u] fl N[v] \ < k — 1. For each v G B, we 
have \N[a] fl N[v] \ = 4, and for each v G G, we have \N[a] fl N[v] \ = k. Hence, 
when we apply one iteration of A:-intersection polishing with A; > 4, all edges 
between a and B disappear and all non-edges between a and C become edges. 
The edge set of the new graph is different from the original graph, but the new 
graph is isomorphic to the original. Therefore, the statement holds. □ 

It is worth mentioning that this statement holds if A;-intersection polishing is 
defined by a usual neighbor but not a closed neighbor, i.e., N{u) n N{v). □ 

We call a graph G polished if it does not change by data polishing process. 
In particular, we call G k-intersection polished if G satisfies G = P^{G). 

Theorem 3. A k-intersection polished graph G = {V,E) has at most 0(|y|^) 
maximal cliques. 

Proof. Suppose that a vertex set S,\S\ = k is included in two distinct cliques K 
and K'. Then, for any two vertices u G K and v G K', S C A^[r)], A^[m] hold, thus 
I A^[u] n A^[u] I > k. This implies that such u and v are always connected by an 
edge, and thus AT U AT' is a clique. Hence, we can observe that any vertex subset 























S Q V oi size k is included in at most one maximal clique. Since any maximal 
clique of size at least k includes at least one vertex subset of size k, we can see 
that the number of maximal cliques of size at least k is at most nCk- The number 
of cliques of size at most fc — 1 is bounded by , the number of maximal cliques 
in G is bounded by „Cfc + = 0{n^). Therefore, the statement holds. □ 

This theorem demonstrates that /c-intersection polishing reduces the number 
of maximal cliques. Moreover, a limited number of maximal cliques makes some 
problems easy. For example, we can say that maximal clique enumeration in 
G can be solved in polynomial time in the size of G, since the enumeration is 
done in linear time in the number of maximal cliques [10]. We can also observe a 
tractability on the maximum clique problem. 

Corollary 2. The maximum clique problem in a graph G polished by k-intersection 
polishing can be solved in polynomial time, time in particular. 

Proof. The algorithm MACE[TD] enumerates all maximal cliques in time 
where M is the number of maximal cliques in G. This together with the above 
theorem leads to finding a maximum clique in time. □ 

However, we can observe that some polished graphs have a huge number of 
maximal cliques for large k. The following property demonstrates that there are 
some graphs having exponentially many maximal cliques. 

Theorem 4. There is a series of infinitely many k-intersection polished graphs 
that have exponentially many maximal cliques. In particular, a graph of nf /^-V 
injl vertices has at least 2”/^ maximal cliques for some k. 

Proof. Let G be a graph of vertex set V = {1,..., n} such that any vertex pair 
except for 2i — 1 and 2i for i = 1,... ,n/2 — 1 is connected by an edge. G is 
the removal of a perfect matching from a complete graph. We assume that n is 
a multiplier of 2. The graph has exactly 2"/^ maximal cliques, since any choice 
of one vertex from each non-edge forms a maximal clique. Any u and v satisfies 
\N[u] n N[v] \ = n — 2 in G. We add vertices and edges to G to obtain a graph 
G' satisfying the statement. 

Let Vi = {1,3, 5,..., n — 1} and V 2 = {2,4,6,..., nj be a partition of V 
into two subsets of equal size. We then consider sets of vertices Uq = V 2 , and 
Ui = Vi\ {2i — 1} U {2i}, for 1 < * < n/2. We can see that any edge in G 
is included in at least one of Uj,0 < j < n/2, and no non-edge is included in 
some Uj. For each Uj, we add n/2 vertices and connect edges so that Uj and the 
added vertices form a clique. The obtained graph is denoted by G' = {V',E'). 
We have \ V'\ = n-l- (n/2-|-1) x n/2 = n^/4-|-3n/2. Observe that by the addition, 
a pair of vertices increases their common neighbors by n/2 if and only if they 
are connected by an edge. Thus, the graph is an n-intersection polished graph. 
We can see that any maximal clique in G other than Uj is a maximal clique in 
G', thus we still have 2"/^ maximal cliques. Therefore, the statement holds. □ 

For real world data, fc-intersection polishing generates acceptable clusters in 
some cases such as when the pseudo cliques are disjoint. However, it often does 
not work well if there are many vertices adjacent to several large degree vertices 


(hubs), since these vertices form a very large clique even though they are not 
related. This property is often observed in many kinds of networks such as social 
networks, and Web link networks. The cause is that two vertices are connected, 
even though many of their neighbors are not common. In such cases, instead 
of fc-intersection model, similarity measures are preferable such as the Jaccard 
coefficient. By using similarity, vertices adjacent to many different vertices will 
not be connected. 

We denote the similarity of N[u] and by sim{u,v), where N[u] and N[v] 
are more similar when sim{u, v) is large. For example, if the similarity measure is 
the Jaccard coefficient, sim{u,v) = |A^[M]UA^[u]|/|fV[M]nA^[u]|. We define a graph 
P®™(^)(G) in the same manner, i.e., = (V", {(u, u)|sim(u, u) > 6*}). 

The operation of computing is called 9 — sim polishing, and a graph 

is called 9 — sim polished if G = P'*®™(®)(G). In particular, if the similarity 
measure is the Jaccard coefficient, they are d-Jaccard polishing and 0-Jaccard 
polished. Unfortunately, even in these cases polishing may not converge for some 
graphs. 


Theorem 5. For every 9 € (0,1/4] U (3/11,2/7], there is a graph G = {V,E) 
such that p^*™(«)(G) ^ G and ^ q 


Proof. We define a graph Gmi.ma.n = as follows. Let 


U = {a}UBUGU y Aj 

where B = {&i,...,6„}, G = {ci,...,c„}, |A,i| = wi, and |A,il = ^2 for 
i 7 ^ j. The edge set E is defined so that {a} x B and B x C form bicliques and 
Dij U {bi, Cj} forms a clique for each i,j. No other vertices are connected. Then, 
for each i, j G {1,..., n} and d G Dij, 


A^[a] = {a} U B 

7V[6i] = {a,bi}\JG\j\^Dij, 
j 

N[cj] = {cj}^ Bu[jDij, 

i 

iV[d] = A.j U {bi,Cj}, 


|iV[a]| = n + 1, 

|7V[6i]| = TOi + (n — l)m 2 + n + 2, 

]A^[cj]| = TOi + (n — l)m 2 + n + 1, 
|iV[d]| =m + 2. 



where m = mi if c? G Di^i and m = m2 otherwise. Thus, for i' ^ i, f ^ j, 


d' G Dij \ {d} 

and 

d" i {a 

}UB 

UGU A,j, 

N[a] 

n N[bi] 

= {a, 

h}, 


sim{a, b) 


n N[b,,] 

= {+ 

^UG, 


sim{bi,bi') 

N[c,] 

n N[cj,] 

= B, 



sim{cj, Cji) 

iV[a] 

n N[cj] 

= B, 



sim(a, Cj) 

N[b,] 

n N[cj] 

= A, 

.3 u {K 

Cj}, 

sim{bi,Cj) 

N[b, 

] n N[d] 

= A, 

.3 u {K 

Cj}, 

sim{bi, d) 

N[cj 

] n N[d] 

= A, 

.3 U {bi, 

Cj}, 

sim{cj, d) 

iV[d] 

n N[d'] 

= A, 

,3 U {^0 

Cj} 

sim{d, d') 

N[d] 

nN[d"] 

< 1 



sim{d, d") 


2 

2n + mi + (n — l)m 2 + 1 ’ 
n + 1 

2toi + 2(n — 1 )to 2 + n + 3 ’ 
n 

2mi + 2{n — 1)to 2 + n + 2 ’ 
n 

mi + (n — l)m 2 + n + 2 ’ 

m + 2 

2mi + 2{n — 1 )to 2 — m + 2n + 1 ’ 
m + 2 

mi + (n — l)m 2 + n + 2 ’ 
m + 2 

mi + (n — l)m 2 + n + 1 ’ 

1 , 

1 

min{mi, m 2 } + m 2 + 3 


and sim{u,v) = 0 for any other pairs of distinct u and v. 

Now we discuss when the graph (Gmi.ma.n) be symmetric to G 

in the sense that edges between {a} and B have been removed and {a} x G 
have become a biclique while the other edges are preserved. This occurs when 
/(mi,m 2 ,n) < 0 < g{mi,m 2 ,n) for 


/(mi, m 2 , n) = max {sim(a, bi), sim{bi, bii), sim{cj,Cj>), sim{d, d")} 


and 


g(mi, m 2 , n) = min {sim(a, cj), sim{bi, cj), sim{bi,d), sim(cj,d), sim{d, d')} . 
We have _ 


mi 

m2 

n 

/(mi,m2,n) 

g{mi,m2,n) 

1 

2 

2 

3/11 

2/7 

2 

2 

2 

3/13 

1/4 

2 

2 

3 

2/9 

4/17(> 3/13) 

2 

3 

2 

1/5 

2/9 

3 

3 

3 

1/6 

3/14(> 1/5) 


and for all m > 3, 

g(m, m, 3) > g{m + 1, m + 1,3) > f(m, m, 3) > /(m + 1, m + 1,3). 


Note that limm-ioo gim, m, 3) = 0. Therefore, for every 9 G (0,1/4] U (3/11,2/7], 
there is a graph G satisfying the property of the theorem. □ 




Finally, we describe our micro-clustering algorithm and data polishing algo¬ 
rithm. Here r is the number of the maximum repetitions. 

Algorithm Micro-Clustering_Similarity (iD:database, 0', d:thresholds for record 
similarity and neighbor similarity) 

1. construct the similarity graph Gsim'{D) with threshold 9' 

2. G' = GraphPolishing_for_Cluster (Gsim'iD), 9) 

3. output all maximal cliques of G' 

Algorithm GraphPolishingJor_Cluster (G = (H, i?):graph, 0:threshold) 

1. for i := 1 to r 

2. E'■.= {{u,v)\sim{u,v) > 9} 

3. \i E' = E then break 

4. end for 

5. output G' = {V,E') 

The maximal clique enumeration is done by algorithms such as MAGE jlOpl . 
and Tomita’s algorithm For the computation of (G) and similarity 

graph for set families (transaction databases), the algorithm described in the 
next section is efficient [18]. 

4 Algorithm for Fast Data Polishing 

A straightforward way to compute sim{u, v) for all pairs of u and v takes at 
least 0(|Hp) time. This is very heavy when \V\ is large, such as one million. We 
can not avoid this difficulty when the similarity graph or the polished graph is 
dense, and thus has 0{\V\'^) edges. In fact, this implies that elements are similar 
to many others, thus there are only few clusters. They can be tractable by usual 
clustering algorithms such as /c-means. Micro-clustering aims to find many small 
clusters, thus such dense graphs are not interesting, and we assume that polished 
graphs are sparse. However, Even with this assumption, computation less than 
square time is non-trivial. Eor efficient computation, we observe the following. 

Observation: In many similarity measures for neighbors, sim{u, v) > 9 only 
when \N[u] fl N[v] \ > 0 

If the similarity graph is sparse, the intersection size \N[u] fl N[v]\ is zero 
for almost all pairs of vertices, and only few pairs satisfy |fV[M] fl 7V[n]| > 0. 
Moreover, many similarity measure for sets can be computed quickly with the 
use of \N[u] n A['(;]|, since |iV[u] U= |A'[u]| -I- \N[v]\ — |IV[m] fl A['(;]|, \N[u] \ 
A[n]| = |fV[u]| — n IV[u]|, and so on. Therefore, we describe an algorithm 
for computing the intersection of closed neighbors, for each pair having non¬ 
empty intersection. The algorithm is actually a kind of folklore and often used 


® http://research.nii.ac.jp/'uno/codes.html 
^ http://research.nii.ac.jp/'uno/codes.html 





in literature|18). The algorithm can be seen as a variant of the transposition 
of a matrix represented in a sparse manner. We additionally show a bound for 
computation time when the degree sequence of the graph satisfies the Zipf’s law. 
To the best of our knowledge, this is the first theoretical result that explains the 
reason this folklore algorithm is fast in practice. We begin with the following 
property. 

Property 2. iV[u] has non-empty intersection with N[v] if and only if u and v 
are both in N[w] for some w. □ 

From this property, we can see the following property. 

Property 3. |fV[u] fl fV[u]| is equal to the number of vertices w G such that 
V G N[w]. □ 

We can see that \N[u] fl N[v]\ for all v can be computed by scanning fV[u>] 
for all w G fV[u], i.e., we scan N[w] one by one. For each v G N[w], we increase 
the counter for v, then the counter will be the intersection size after all scans. 
This idea is described as the following algorithm. 

for each w G do 

for each v G iV[rc] s.t. v < u, intersection[v\ := intersection[v\ + 1 

end for 

In the algorithm, we want to output non-zero intersection size. For the sake, 
we use a list L that keeps the vertices having non-empty intersection with u. 
A vertex v is inserted to L when intersection[v\ is changed from 0 to 1. The 
following algorithm computes the intersection size for all pairs by executing the 
above algorithm for all u G F. 

Algorithm Neighborintersection (G = (V, E)) 

1. intersection[u\ := 0 for each u € V 

2. for each u do 

3. L 0 

4. for each ic G do 

5. for each v G s.t. v < u do 

6 . if intersection[v\ = 0 then insert v to L 

7. intersection[v] := intersection[v] + 1 

8. end for 

9. end for 

10. for each v G L output “{v,u}intersection[v]”, intersection[v] := 0 

11. end for 

Note that this algorithm also works for set families (transaction databases) 
where a set family (transaction database) F is a collection of subsets of a set (an 
itemset) E. The computation time of the algorithm is 0(J2uev'^weNlu] l{'^ ^ 
N[w] I V < u}|). The term means that “for all pairs of u and 


its closed neighbors w”, thus is equivalent to Therefore, the 

computation time is Eu,Gy ^ I ^' > m}| = 

Because we want to understand the practical performance of this algorithm, 
we consider about the assumption of Zipf’s law. It is known that real world 
data often satisfies Zipf’s law, for example scale free graphs. We assume that the 
degree sequence of the similarity graph is within a Zipf’s law, i.e., the expected 
value of the degree of the ith vertex is a/i^ for some constant a and A. For 
stating a time complexity, we consider the case in which for some constants c and 
P, any vertex w satisfies |lV[r(;]| < maxlca/r/;^,/?}. Since the time complexity 
does not change even if few vertices violate the condition, this assumption would 
be acceptable. We have the following by Zipf’s law. 


Z l«HI’ < E maxjca/ic^, /3}^ 

w€.V w£V 

w£V w£V 

wGV 

Thus, the computation time is bounded by 0 (q;^ log |y|) for A = 1/2, and 
0{a^) for A > 1/2. For example, if a, which is intuitively the number of ap¬ 
pearances of the most frequently appearing items, is 0{\V\^/'^), the algorithm 
terminates in 0(|1Z| log |1^|) time if Z\ = 1/2, and 0(|1^|) time if Z\ > 1/2. 

Theorem 6. For a given similarity graph G = (V,E), algorithm Neighhorln- 
tersection terminates in 0(a^log|lZ|) time for A = 1/2, and 0{a^) time for 
A > 1/2, if the degree of the ith vertex is at most max{ca/i‘’^,/3} time, where 
c, j3, and A are constant numbers. 

An implementation of this algorithm is available at our Website 
(http://research.iiii.ac.jp/~uno/codes.html). The name of the implemen¬ 
tation is SSPC. 

4.1 Equivalence to frequent itemset mining 

A transaction database is a database composed of records such that each record 
Ti is a subset of a set of items I. The frequency of an itemset P <ZV is the number 
of transactions including P. For a threshold cr, P is called a frequent itemset if 
its frequency is no less than a. The problem of enumerating all frequent itemsets 
is a fundamental problem in data mining thus has been extensively studied and 
has many applications. Consider the transaction database {7V[M]|rt G V} where 
V is regarded as /. Since the number of transactions including itemset {T'u,T„} 
is equal to the number of vertices w such that u,v G A^[w], the frequency of 
{Tu,Tv} is equal to \N[u] D lV[r']|. Thus, we can find all pairs of vertices having 
non-empty neighbor intersection by enumerating all frequent itemsets of size two 


with tr = 1. In fact, LCM algorithm |19ll8] for frequent itemset mining uses an 
algorithm similar to Neighborintersection to compute the frequencies of itemsets 
P U {e} for all e, and it has a good practical efficiency for real world large scale 
databases. 

5 Maximal Clique Enumeration 

Maximal clique enumeration also has long history from the 1970’s [an]. In the 
2000 ’s, researches began on practical large scale data as the growth of data cen¬ 
tric science. The number of maximal cliques is usually tractable even for large 
scale data in such area. Recent algorithms such as Makino-Uno algorithm [TO] 
(MACE) and Tomita’s[T3] can enumerate in short time in such cases. The enu¬ 
meration frameworks of these algorithms are quite different, but both they work 
well in practice. As reported in m, Tomita’s algorithm is faster in dense graphs, 
and MACE is faster in sparse graphs. 

MACE is a depth first search type algorithm traversing the maximal clique 
space. It starts from the lexicographically maximum clique, and moves to another 
clique recursively by adding a vertex, removing non-adjacent vertices to it, and 
adding addible vertices lexicographically. Duplication is avoided by introducing a 
tree shaped transversal route induced by parent-child relation between maximal 
cliques. Since the algorithm explores no redundant clique, the computation time 
per maximal clique is bounded by polynomial, and the practical performance is 
good. 

On the other hand, Tomita’s algorithm is a branch and bound type. It starts 
from the empty set, and recursively adds vertices in a lexicographic order. At 
every iteration, it chooses a pivot vertex, and re-orders the vertices so that the 
vertices adjacent to the pivot are located at the latter part. Since any clique 
composed only of vertices adjacent to the pivot is not maximal because we can 
add the pivot to it, we can omit the recursive calls with the addition of these 
vertices, in this ordering. This pruning works well especially in dense graphs. 
When we want to find only maximal cliques of at least a given threshold size, 
we have one more advantage. We can delete vertices of degrees smaller than the 
threshold, even in the subgraph given to the recursive calls. This pruning is not 
applicable to MACE. 

6 Experiments 

We now explain the results of our computational experiments. Our implementa¬ 
tions were coded in C, without any sophisticated library such as binary trees. All 
experiments were conducted on a standard PC that are for consumer’s use. Note 
that none of the implementations used multi-cores. The codes and the instances 
are available at our Website (http: //research.nii .ac. jp/~uno/codes .html). 
As test instances, we prepared three types of data; randomly generated data, 
business relation among Japanese companies, and Reuters news articles. As 
shown below, the results suggest that our data polishing algorithm satisfies the 






requirements for micro-clustering, that are quantity, independence, coverage, 
granularity, and rigidity. 

The first instances are randomly generated graphs for evaluating how well 
clustering methods can efficiently find micro-clusters. First, we initialized a graph 
with no edge, and randomly put h original cliques of size 30 on the graph, so that 
each vertex is included in at most b cliques, by re-choosing a vertex when that 
vertex was already included in b cliques. We made graphs of 50,000, 100,000, 
and 350,000 vertices, and h is set to 1,000, 3,000, and 10,000 for each case, 
respectively, b is multiplicity that the number of micro-clusters one vertex can 
belong to, and in practice it would range from one to five or bit more. For each 
vertex v, we randomly re-connect several edges incident to v. For each edge 
{v, u), with probability 1 — p, we replace the end point u by a randomly selected 
vertex u'. Note that this operation does not change the degree of v but changes 
the degrees of u and u'. We tried the graphs of all combinations of b — 1,2,4, 
h = l, 000,3,000,10,000, and p = 0.3,0.5,0.9,1.0 

We evaluated accuracy by F-value-like measure that takes into account both 
precision and recall. For each original clique C, let k{C) be the maximum of 
ICnC'l among all clusters C found by an algorithm, and for each C, let k(C") 
be the maximum oi \C Ci C'\ among all original cliques. We can then consider 
the precision and recall as k(C')/|C'| and k(C")/|C"|, respectively. Let P be the 
average of k(C')/|C'| and R be the average of k{C')/\C'\ over all original cliques 
and clusters. The F-value-like measure is defined by [P + i?)/(2 x P x R). The 
reason we choose the best for precision and recall comes from the principle of 
data mining. Data mining approaches basically enumerate all candidates. In this 
sense, it is preferable if we can find some solutions that correspond to hidden 
structures, even though the number of solutions is a bit large. 

We examined Metis[7], Girvin-Newman[4], DBscan[3] and graph polishing 
with the Jaccard coefficient with thresholds 0.07,0.15 and 0.3. DBscan inputs 
sparse distance matrix, thus we set distance of two vertices to one if there con¬ 
nected by an edge, and otherwise the distance is not defined. The parameters 
“eps” is set to 1.5, and “MinPts” is set to 30 for & = 1, and to 50 otherwise. 
Actually, the cluster sizes produced by Girvin-Newman were quite biased, the 
accuracy were not good. The results by Metis and DBscan were also so. On the 
contrary, the results by graph polishing were good in many cases. As the decrease 
of the similarity threshold, the number of clusters and the accuracy increased. In 
particular, the accuracy was almost 1 in less noisy cases, while that of Metis was 
below 0.5. The number of clusters was not so large, even with threshold 0.07. 
The accuracy did not change much when the problem sizes or b increased. It is 
interesting that the accuracy of Metis increased in such cases. 

The second instance involved business relation data among Japanese compa¬ 
nies in which the vertices are the companies, and two companies are connected 
when they trade. The data is of year 2012 and provided by Teikoku DataBank 
Limited, Japan. The numbers of vertices and edges in the graph are 3,282 and 
35,168, respectively, and the graph has 32,953 maximal cliques. 


We apply our graph polishing algorithm with the PMI (pointwise mutual 
information) of 0.5, 0.6, 0.7, 0.8, 0.9. The PMI of A and B, A, B C E is a 
similarity measure defined by log(|^ ni3| x |i?|/|yl| x |i3|). The changes in the 
number of edges and number of maximal cliques are summarized in Tabled The 
visualization of the graphs are shown in Figure [3] Many clusters were clarified 
as cliques, and some vertices belonged two or more clusters, while we cannot see 
anything in the original graph. Examples of clusters are as follows. 

— Toyota Motor, Suzuki Motor, Yamaha Motor, Daihatsu Motor, Mazda Mo¬ 
tor, Isuzu Motor, Nissan Motor, Hino Motor, Fuji Heavy Industries, Honda 
Motor, Mitsubishi Motor, Sato Shoji, Jidosha Buhin Kogyo 

— Nissan Shatai, Aichi Machine Industry, Calsonic Kansei, Sincerity Passion 
Kindness, JFE Container, SNT, Zero, Unipres, Kokusai, Gexeed 

— Tsuchiya, Insight, Career Bank, Japan Care Service, Ecomic, JPN Holdings, 
Accordia Golf 


Table 1. Results when 6=1 (acc. = accuracy, #cls. = #clusters) 



Polish 

0.07 

Polish 

0.15 

Polish 

0.3 

Metis 


DBscan 


Newman 


#cls., p 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

1,000,0.3 

0.993 

1,601 

0.798 

1,552 

0.229 

4,891 

0.072 

1,396 

0.256 

710 

0.329 

27 

3,000,0.3 

0.991 

3,217 

0.484 

5,075 

0.146 

1,631 

0.119 

3,299 

0.190 

2,404 

0.158 

18 

10,000,0.3 

0.993 

11,088 

0.574 

15,970 

0.153 

10,169 

0.305 

11,431 

0.203 

7,868 

0.140 

40 

1,000,0.5 

0.999 

1,155 

0.998 

1,026 

0.619 

2,772 

0.065 

1,396 

0.381 

732 

0.312 

59 

3,000,0.5 

1 

4,273 

0.991 

3,152 

0.299 

10,375 

0.117 

3,299 

0.315 

2,380 

0.159 

36 

10,000,0.5 

1 

15,379 

0.993 

10,390 

0.352 

37,892 

0.302 

11,431 

0.329 

7,952 

0.080 

87 

1,000,0.9 

1 

1,173 

1 

3,364 

1 

2,929 

0.064 

1,373 

0.798 

785 

0.451 

396 

3,000,0.9 

1 

4,249 

1 

4,300 

1 

3,467 

0.114 

3,277 

0.803 

2,379 

0.352 

636 

10,000,0.9 

1 

14,585 

1 

17,472 

1 

13,106 

0.304 

11,326 

0.799 

7,843 

0.183 

1,001 

1,000,1.0 

1 

999 

1 

999 

1 

999 

0.092 

1,000 

1 

1,000 

1 

1,000 

3,000,1.0 

1 

2,999 

1 

2,999 

1 

2,999 

0.154 

3,000 

1 

3,000 

1 

3,000 

10,000,1.0 

1 

9,999 

1 

9,999 

1 

9,999 

0.634 

10,000 

1 

10,000 

1 

10,000 


The first cluster is of all Japanese car manufacturers, a car parts manufac¬ 
turer, and a trading company of metal materials. The second is of car parts 
manufacturers, trading companies on car parts, and those of transportation and 
containers. The third is of human resources, land development, and investment. 
Four of them are companies from Hokkaido, a prefecture of Japan. We can see 
deep relations among the members according to corporate affairs and regions. On 
the other hand, the clusters seemed to cover the related companies, for example 
all motor companies are in a cluster. 

The third instance is of news articles from Reuters news, a well known mass 
media. The name of the dataset is RGVl (English), and the provider is the 
National Institute of Standards and Technology (NIST) [8]. The articles are from 
20/Aug/1996 to 19/Aug/1997, and the number of articles is 806791. Each article 

























Table 2. Results when 6 = 2 (acc. = accuracy, #cls. = #clusters) 



Polish 

o 

d 

Polish 

0.15 

Polish 

p 

CO 

Metis 


DBscan 


Newman 


#cls., p 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

1,000,0.3 

0.853 

1,694 

0.679 

1,448 

0.197 

3,742 

0.057 

1,194 

0.304 

63 

0.287 

17 

3,000,0.3 

0.518 

15,868 

0.463 

4,252 

0.145 

3,871 

0.111 

2,888 

0.269 

511 

0.267 

46 

10,000,0.3 

0.616 

15,683 

0.498 

14,389 

0.151 

15,928 

0.371 

9,917 

0.274 

1,517 

0.269 

113 

1,000,0.5 

0.998 

2,111 

0.944 

1,129 

0.476 

2,217 

0.072 

1,194 

0.417 

147 

0.286 

17 

3,000,0.5 

0.999 

4,533 

0.867 

3,740 

0.269 

6,518 

0.110 

2,888 

0.359 

697 

0.291 

23 

10,000,0.5 

1 

13,402 

0.881 

11,880 

0.298 : 

22,267 

0.386 

9,917 

0.366 

2,218 

0.356 

31 

1,000,0.9 

0.999 

1,489 

1 

4,967 

0.896 

5,306 

0.066 

1,183 

0.477 

270 

0.451 

396 

3,000,0.9 

1 

31,289 

1 

8,136 

0.834 

9,409 

0.111 

2,874 

0.359 

785 

0.161 

122 

10,000,0.9 

1 

168,539 

1: 

26,567 

0.842 : 

33,678 

0.341 

9,861 

0.372 

2,704 

0.190 

195 

1,000,1.0 

0.999 

1,303 

0.999 

1,303 

0.999 

1,303 

0.082 

781 

0.459 

241 

0.082 

26 

3,000,1.0 

1 

3,776 

1 

3,776 

1 

3,776 

0.145 

2,117 

0.365 

711 

0.098 

23 

10,000,1.0 

1 

10,758 

1 

10,758 

1 

10,758 

0.475 

7,149 

0.375 

2,387 

0.301 

43 


Table 3. Results when 6 = 4 (acc. = accuracy, #cls. = #clusters) 



Polish 

O 

d 

Polish 

0.15 

Polish 

CO 

d 

Metis 


DBscan 


Newman 


#cls., p 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

acc. 

#cls. 

1,000,0.3 

0.819 

2,162 

0.658 

1,438 

0.198 

3,789 

0.055 

1,166 

0.273 

131 

0.281 

20 

3,000,0.3 

0.460 

23,929 

0.454 

4,108 

0.153 

4,441 

0.108 

2,787 

0.215 

692 

0.236 

67 

10,000,0.3 

0.543 

24,965 

0.483 

1,4053 

0.157 

1,6940 

0.511 

9,587 

0.227 

2111 

0.290 

152 

1,000,0.5 

0.994 

2,666 

0.916 

1,128 

0.473 

1,973 

0.068 

1,166 

0.329 

193 

0.325 

16 

3,000,0.5 

0.995 

5,925 

0.815 

3,853 

0.282 

5,904 

0.113 

2,787 

0.230 

690 

0.271 

16 

10,000,0.5 

0.997 

13,653 

0.832 

12,449 

0.304 

20,301 

0.532 

9,587 

0.247 

2,201 

0.377 

47 

1,000,0.9 

1 

2453 

0.997 

5,761 

0.866 

4,830 

0.067 

1,156 

0.363 

218 

0.235 

45 

3,000,0.9 

1 

14,784 

0.989 

10,756 

0.768 

9,262 

0.110 

2,775 

0.207 

592 

0.172 

57 

10,000,0.9 

1 : 

254,357 

0.991 

32,705 

0.782 

3,2676 

0.470 

9,543 

0.230 

2,062 

0.168 

118 

1,000,1.0 

1 

1803 

0.997 

1,939 

0.926 

1,705 

0.081 

754 

0.371 

205 

0.138 

27 

3,000,1.0 

1 

5703 

0.985 

8,018 

0.808 

5,735 

0.134 

1,986 

0.237 

523 

0.319 

37 

10,000,1.0 

1 

12,343 

0.988 

19,293 

0.826 

18,565 

0.681 

6,739 

0.260 

1,838 

0.251 

41 





































































































































































































Fig. 3. Visualization of the business relation graph and a polished graph 


Table 4. Results on business relation data with different PMI thresholds 


PMI 

0.9 

0.8 

0.7 

0.6 

0.5 

#edges 

93 

1,172 

13,786 

73,132 

232,173 

^cliques 

59 

341 

521 

343 

254 


was converted to “bag of words” which is a set of words that the article includes. 
Even though two or more identical words appear in an article, we consider them 
as appearing once, thus no two identical words appeared in the bag of words of 
an article. To remove words appearing quite often, and some special words such 
as ID, we ignored words appearing in at least 1% of articles, or less than ten 
articles. 

We constructed a similarity graph of the articles so that vertices were ar¬ 
ticles, and two vertices were connected when the Jaccard coefficient between 
corresponding articles was no less than 0.2. We applied maximal clique enumer¬ 
ation (MACE), Metis, Girvin-Newman, and our micro-clustering to the graph 
and obtained the clusters. MACE did not terminate in one hour, and produced 
more than 50 million maximal cliques. The sizes were relatively large on average, 
likely due to exponentially many cliques included in a large dense subgraph. The 
number of the clusters in the results of Girvin-Newman, Metis, micro-clustering 
was 25,826, 97,260 and 96,607, respectively. The sizes of clusters by Girvin- 
Newman were biased, so that the maximum and second maximum are 240,787 
and 75,384, i.e., 30% and 9% of all the articles. At the same time there were 

















many small clusters 2 or 3 in size. The sizes from Metis were not biased; almost 
all clusters were 7 or 8 in size, and the maximum was 10. However, it seemed to 
be too much uniform. We could find several clusters that would be mixtures of 
several topics such as the following two clusters. 

TOKYO 1996-11-22 JAPAN: Daimon -96/97 parent forecast. 

TOKYO 1996-11-22 JAPAN: Daimon - 6mth parent results. 

TOKYO 1996-08-30 JAPAN: Daimon - 96/97 div forecast. 

TOKYO 1997-03-28 JAPAN: Daimon -96/97 parent forecast. 

BRUSSELS, Belgium BELGIUM: EU, Mexico sign cognac and tequila deal. 
MANILA I996-I0-3I PHILIPPINES: PHILIPPINE STOCKS - factors to watch 

- Oct 31. 

BRUSSELS 1996-10-03 BELGIUM: WTO finds against Japan on liquor tax - 
EU official. 

WASHINGTON 1996-10-21 USA: PRESS DIGEST - Washington Post Business 

- Oct 22. 

WASHINGTON 1997-03-27 USA: Earallon group raises Strawbridge stake. 
OLDWICK, N.J. 1997-04-15 USA: A.M. Best upgrades Exel to A plus from A. 
LONDON 1997-06-02 UK: EXEL to buy stake in Lloyds managing agency. 
NEW YORK 1997-06-09 USA: UC’NWIN says appoints Niall Duggan as CEO. 
PHILADELPHIA 1997-07-14 USA: Strawbridge cuts distribution to holders. 
AKRON, Ohio 1996-10-01 USA: ABC Dispensing names Crate CFO. 

In the results of our graph polishing, the cluster sizes range from 1 to 292, 
and most of the clusters were of from 2 to 10 in size. Not many clusters seemed to 
be mixtures, as with Metis’s. On the contrary, there were many articles that had 
the same title, or almost the same titles. Their contexts must be deeply related, 
thus they should be in the same cluster. We selected some strings composed 
of several words that were common prefixes of many articles, and counted the 
number of clusters including the strings. The results are summarized in Table [5j 
Here (1) means that the number of clusters is one, but two clusters of (1) were 
merged into one. We can see that Metis tended to partition articles that should 
have been included in the same cluster. On the other hand, Girvin-Newman 
tended to gather non-deeply related articles in a cluster. Our graph polishing 
did not tend to do both, and seemed to obtain good clusters, compared to two 
popular clustering methods. In the examples in Table[5l we showed some clusters 
that would correspond to local areas. In such cases, the words used in the article 
should be categorized according to the local areas, such as India, and the result 
of the Girvan-Newman will be good. 

Further, we tried instances taken from twitter, that is of tweets including a 
specified word, and make records of bag of words. We tried “lunch” as the word, 
and applied clustering algorithms. Actually the results were not good, and we 
could see no interesting cluster. This would be because that the tweets are similar 
to each other, and topics change smoothly. For example, there would be records 
of “go, restaurant, lunch”, “go, tasty, restaurant, lunch”, “visit, tasty, restaurant. 


Table 5. Number of clusters including specified strings 



Metis 

Girvin-N ewman 

Polishing 

Egyptian pound averages 

18 

5 

7 

INDIA: INDIA GOVERNMENT SECURITIES 

37 

1 

1 

India Kothari Pioneer MMMF 

4 

1 

1 

TAIWAN: Taiwan BSPA 

13 

1 

2 

USA: High Plains Wheat 

26 

4 

4 

Enrobonds - Expected new issues -Middle East 

3 

1 

1 

Eurobonds - Expected new issues -Asia 

1 

1 

1 

Eurobonds - Expected new issues (-central) 

44 

(1) 

1 

Eurobonds - Expected new issnes -Latam 

6 

(1) 

1 



Fig. 4. Unpolished data converted to polished data by data polishing 


lunch”, “visit, tasty, restaurant, lunch, twice”,... This intuitively implies that the 
boundaries of the topics are not clear, and topics are connected smoothly, in the 
data of bag of words. In such data, it seems that we can hardly find good clusters, 
or there would be no clusters. 


7 Discussion 

Data polishing can be seen as to find polished data from given un-polished data 
(see Figure |4]). In the case of graph, we use a feasible hypothesis based algorithm 
for the task. However, we think another approach, for example minimizing the 
difference between the original graph and polishing graph. The approach has a 
theoretical advantage for model explanation, but the minimization problem is a 
hard optimization requiring long computation time. Although feasible hypoth¬ 
esis approaches cannot guarantee regarding approximation, it ensures that the 
obtained polished graph is generated with natural modifications, thus has less 
information loss. 




















Consider a graph made from a clique by subdividing each edge, where a 
subdivision replaces an edge by a path of two edges. In some models and data this 
graph should be recognized as a cluster, but our graph polishing deletes all edges, 
since any vertex pair has at most one common neighbor. On the other hand, 
consider a chain of cliques that overlap neighboring cliques. Graph polishing 
changes this graph to a clique if the size of each overlap is sufficiently large. In a 
usual sense, the cliques should be split if the chain is too long. As we discussed, 
data polishing is not ideal for taking into account global structures. At this point, 
we should consider that the clusters provided from graph polishing are still seeds 
or unification, but are better compared to those from existing algorithms. 

We can apply clustering algorithms to polished graphs. In our experiments 
in another research[T2], the clusters of features obtained by these algorithms 
on polished graphs increased the accuracy of a machine learning task. Polished 
graphs are not only for clustering, thus there are other possible application uses. 

Similarity graph construction is a key issue to our graph polishing. The com¬ 
putational cost is usually large, and sometimes we have no approach to improve 
efficiency. In such cases, we can consider the use of approximation. There are 
several approaches to find similar elements from the data approximately and 
quickly. The obtained similarity graph is different from the exact one, but fea¬ 
sible hypothesis should also hold in the approximate graph. This should hold 
when data and clusters are larger, since the members should have many com¬ 
mon neighbors. 


8 Conclusion 

We discussed requirements for the enumeration of many relatively small clusters 
of elements that are deeply related to each other, that we call micro-clustering. 
We proposed a new concept of data processing called “data polishing”, and 
formulated the clustering problem by data polishing and clique enumeration. 
Data polishing clarifies the local structures and groups by actively modifying the 
data according to some simple feasible hypothesis. The design of our algorithm is 
simple, and thus easily applicable to many kinds of data. Several statements were 
proved to ensure that we never miss large clusters in the similarity graph. We 
also showed a simple and efficient technique for the data polishing, and showed 
some complexity results for the computation time of the algorithm under the 
assumption of the Zipf distribution. 

The computational experiments demonstrated the efficiency of our data pol¬ 
ishing algorithm regarding accuracy and utility. Clusters provided with data 
polishing are usually not split nor mixed with others, while clusters often are 
with existing clustering algorithms. The sizes, quantities, and granularity of the 
clusters were good for practice. In previous studies, the clusters in big data were 
often used to data analysis algorithms such as machine learning, image recog¬ 
nition, and prediction. However, the disadvantages on the utility in the results 
of existing clustering seem to decrease the speed in the practical use. The de- 


velopment of data polishing would increase the practicality of clusters in such 
areas. 

Interesting future work would be to develop data polishing models for other 
kinds of data processing such as bi-clustering, segmentation, visualization, flow 
detection, and anonymization. It would also be interesting to improve our graph 
polishing algorithm so that we can deal with non-graphic data. 
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